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Abstract: This report considers how to inject external candidate solutions 
into the CMA-ES algorithm. The injected solutions might stem from a gradient 
or a Newton step, a surrogate model optimizer or any other oracle or search 
mechanism. They can also be the result of a repair mechanism, for example to 
render infeasible solutions feasible. Only small modifications to the CMA-ES 
are necessary to turn injection into a reliable and effective method: too long 
steps need to be tightly renormalized. The main objective of this report is to 
reveal this simple mechanism. 

Depending on the source of the injected solutions, interesting variants of 
CMA-ES arise. When the best-ever solution is always (re-)injected, an elitist 
variant of CMA-ES with weighted multi-recombination arises. When all so- 
lutions are injected from an external source, the resulting algorithm might be 
viewed as adaptive encoding with step-size control. 

In first experiments, injected solutions of very good quality lead to a conver- 
gence speed twice as fast as on the (simple) sphere function without injection. 
This means that we observe an impressive speed-up on otherwise difficult to 
solve functions. Single bad injected solutions on the other hand do no signifi- 
cant harm. 
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1 Introduction 

The CMA-ES (Covariance Matrix Adaptation Evolution Strategy [4, o, 2]) is a 
search stochastic algorithm for non-convex continuous optimization in a black- 
box setting, where we want minimize the objective function (or fitness function) 

/ : R" ^ M, XK^ /(x) 

without exploiting any a priori specified structure of /. The CMA-ES algorithm 
entertains a multivariate normal sampling distribution for x and updates the 
distribution parameters with a comparatively sophisticated procedure, see Fig- 
ure 1. While the algorithm is quite robust to large irregularities in the objective 
function /, even small changes of the update procedure can lead to a dramatic 
break down of its performance. This property has been perceived as a main 
weakness of the algorithm. 

In this report we show how to make CMA-ES robust to (almost) arbitrary 
changes of the solutions used in the update procedure. In other words, we reveal 
the measures to properly inject external proposals for either candidate solution 
points or directions into the CMA-ES algorithm by replacing some of the in- 
ternal solutions originally sampled by CMA-ES, or, equivalently, use solutions 
that are modified in any desired way (for example to make them feasible). 

External or modified proposal solutions or directions can have a variety of 
sources. 

• a gradient or Newton direction; 

• an improved solution, for example the result of a local search started from 
a solution sampled by CMA-ES (Lamarckian learning), which allows to 
use CMA-ES in the context of memetic algorithms; 

• a repaired solution, for example from a previously infeasible solution; 

• an optimal solution of a surrogate model built from already evaluated 
solutions; 

• the best-ever solution seen so far; 

• proposals from any algorithm running in parallel to CMA-ES (migration) . 
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Because injecting a single bad solution essentially corresponds to decreasing the 
population size by one, no particular care needs to be taken that only (excep- 
tionally) good solutions are introduced. Any promising source of solutions might 
be used. Within CMA-ES, solutions are sampled symmetrically and therefore 
also virtually never lead to a systematic improvement before selection. 

When all originally sampled internal solutions are replaced, the resulting 
procedure resembles adaptive encoding [ ]. The main differences to adaptive 
encoding are: (i) external solutions are represented in the original (phenotypic) 
space (ii) step-size control remains in place and (iii) the parameter setting is 
different. Using a different (genotyp) representation to generate new external 
solutions is the crucial idea of adaptive encoding and can also be employed here. 

The modifications introduced in CMA-ES are small but will often be decisive. 
They are outlined in the next section. 

Notations Throughout this report, we use for E||7V(0, 1)|| = V2r(2±i)/r(f ) 
the approximation ^/n (l ^ 57 + 2in'^ ) • notation a A 6c + d denotes the 

minimum of a and be -f d. 



2 Injection in the CMA-ES Algorithm 

The CMA-ES algorithm that tolerates injected solutions is displayed in Fig. 1. 
New parts are highlighted with shaded background. Injected solutions replace 
Xi in (1). The decisive function aciip(., .) used in (3) and (6) reads 

aciip(c,x) = 1 A ^ . (12) 

However, different choices for aciip(-) ■) are possible, or even desirable, and dis- 
cussed below. With parameter setting Cy = c™ = A™'''^ — 00, the original 
CMA-ES is recovered (in this case. Equations (3) and (6) are meaningless). 
An injected direction, v, is used by setting 

"' = "^* + ^* ||C^2,|| "- (13) 
If V represents a gradient direction, using instead 

-^=-* + -*^^Cv (14) 

seems to suggest itself. Remark that internal perturbations in CMA-ES follow 
C^/W(0,I), where A/'(0,I) is isotropic and ||7V(0,I)|| « ^} 

The decisive operation for injected solutions is given in Equation (3) of 
Figure 1. Their Mahalanobis distance to the distribution mean is clipped at 
Cy ~ Y^-f 2, preventing artificially long steps to enter the adaptation procedure. 
Additionally, but in most cases rather irrelevant after clipping the single steps, 
^max j-ggg Table 1) keeps possible step-size changes below the factor exp(0.6) w 
1.82. Otherwise, the depicted algorithm is not further modified (unless is set 



^The symmetric Cholesky factor C^/^ does not supply a rotated coordinate system as 
desired for adaptive encoding. In this case, we sample using BDAA(0,I) ~ Ci/2_^/-(o,I), 
where BD : R" R" is the linear decoding. 
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Xi - m* + cr* X C*^^V(0,I) fori = l,...,A (1) 

Yi = ^'''^ . where f{xi:x) < ■ ■ ■ < /(x^^a) < /(x,,+i:a) • ■ ■ (2) 

Yi ^ Qfciip^Cj;, ||C* "yill^xyi if Xi:A was injected (3) 

if Xni was injected 

Am = <[ ^ ''^ (4) 



it+i 



WiYi otherwise 

i=l 

m*+^ = m' + CmCr*Am (5) 
Am ctciip (cy^i VMwIIC* ^Amjl^xAm if Xm was injected or. .. (6) 



P^+' = (l-c,)p^ + v/c.(2-c,)^w C* ^Am (7) 
^ fl if ||p*+i|P<n(l-(l-c.)2(*+i))(2 + 4/(n+l)) 
I otherwise 



Pc+^ = (1 - Cc) p* + Ky/c^{2 - Cc)ti^ Am (9) 



(1 - 4 - c,) C* + ci p*+ip*+i ^ +c,Y, w,y,yj (10) 



rank one update 



..x„p|i;."A£i(J^-i)| (11) 



Figure 1: Update equations for the state variables in the {fi / ii^, X)-CMA-ES 
with iteration index < = 0, 1, 2, . . . and m* e W\ cr* e M+, C* G R"""" positive 
definite, p^jP* G a-nd p^^" = p*^" = 0. C*^" = I and parameters taken 
from Table 1. We have additionally c[ = ci(l — (1 — /icr^)cc(2 — Cc). The chosen 
ordering of equations allows to remove the time index. The symbol Xi-x is the 
z-th best of the solutions Xi, . . . , xa. The "optimal" ordering of (3), (4), (5) and 
(6) is an open issue. 



< oo). We also use the original internal strategy parameters for CMA-ES which 
seems particularly reasonable if only a smaller fraction of internal solutions is 
replaced in (1). 



Strong injection: mean shift If we want to make a strong impact with an 
injection, we can shift the mean. We compute 



x„, - m* 



Am = (15) 

from the injected solution Xm as in (4). When no further solutions x^ are used, 
the remaining update equations can be performed with = 0. With c„i = 1 
(the default), m*+^ — Xm. In order to prevent an unrealistic large shift of m* 
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Table 1: Default parameter values of A)-CMA-ES taken from [ ], where 

by definition X]f=i 1^*1 = ^ = J2i=i ^^"^ a — b A c + d := min(a — 

b,c + d). Only population size A is possibly left to the users choice (see also [ ]) 



A = 4 + [3 In n\ 
M = LIJ ^ , 



Cc 











- 2n/{n 4 


-2) 




+ 2n/{n 


+ 2) 



Cm = 1 

°" n+pw+3 

+ C^ + 

4+OXjjw/" 
n+4+0x2/iw/n 

Ocov inin(l,A/6) 

'^1 ~ (n+1.3)^+Mw 

Cn 



da = 1 + c^ + 2 max ( 0, J ^ - 1 



1 Cl /\ "cov /■„4-9';2_i_„___ ,,. 



Q!cov = 2 could be chosen < 2, e.g. acov — 0.5 for noisy problems 
^max ^ ;^^o or even 0.6 



in (5) we might exchange the order of (5) and (6), therefore applying the length 
adjustment for Am in (6) before the actual mean shift (5). 

Parameter setting The setting of Cy « -^/n + 2 is motivated in Figure 2. 
The figure depicts the relative deviation of ||C^^/^yi|| from its expected value. 
Given its original distribution from CMA-ES, less than 10% of the in (3) are 
actually clipped. For n > 10, the fraction is smaller than 1%. 

The typical length of ^f^vjAm depends on ^//Av and is often essentially larger 
than ^/n. Therefore the setting c™ = ^/n + 2 leads to a visible impairment of 
the otherwise unmodified CMA-ES. This suggests that c™ ~ \/2n + 2 could 
be a reasonable choice, however the setting of c'° yet needs further empirical 
validation. 

In principle, the order of Equations (3), (4), (5) and (6) can be changed 
under the constraint that the computation of Am in (4) is done before Am is 
used in (5) and (6). More specifically, four variants are available, implied by the 
exchange of (3) and (4), or (5) and (6), respectively, (another variant that uses 
undipped for m*^"'^ but clipped ones for Am in the further computations 
is possible, however not by simple exchange of equations). The variants differ 
in whether Am is computed from clipped and whether Am itself is clipped 
before or after to compute m*+^. All these variations seem feasible, because an 
unconstraint shift of m*+^ is per se not critical for the algorithm behavior. 
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Figure 2: Relative deviation of HC^-^/^yJ (top) and ||C^-^/'^yi|p (bottom) from 
its expected value plotted versus dimension. All values are normalized as x i— >■ 
a::/E||A/'(0, 1)11 - 1 (top) and ^ {x^jn— l)/2 (bottom), compare also (11) 
for Ca — da ~ \. Plotted are statistics of the random variable x (top) and x^ 
(bottom), where x^ follows a chi square distribution with n degrees of freedom, 
like ||C-^y: 

n are modal value {x — \Jn ~ \ and x? = V n 
approximation of the expected value ^Jn and the expected value n respectively 
(thin solid), the 1, 10, 50, 90, and 99%tile (dashed) and x — Cy — ^/n + 2n/(n + 
2). 



^ does without injections under neutral selection. Plotted against 

2 respectively as dots), the 



3 Discussion 

All update equations starting from (4) are formulated relative to the original 
sample distribution. This means we are, in principle, free to change the distri- 
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bution before each iteration step. Many reasonable adjustments are possible. 
A mean shift'^ (injecting Xm resembles an arbitrary mean shift with additional 
further updates based on this mean shift), changing the step-size a. increasing 
small variances in C. . . The modification advised in this report is necessary, if 
Xi is not in accordance with the distribution in (1). 

With the introduced modification(s) the CMA-ES can also be used in the 
adaptive encoding context (however using a for the encoding-decoding might 
only turn out to be useful if the encoding is an ajfine linear transformation). In 
the original adaptive encoding [ ], different normalization measures have been 
taken for the cumulation in Pc and for the covariance matrix update, and the 
step-size adaptation has been entirely omitted. In this report here, by default 
the tight normalization of the single steps is the only measure (unless Xm is 
injected). The new normalization replaces the multiplication of the single steps 
with ai in [' ] in the covariance matrix update. The new normalization is tighter: 
choosing Cy ~ instead of Cy « \/n + 2, would be comparable to [ ]. The 

setting therefore allows to apply step-size control reliably. However, the new 
setting is less tight for the mean step, as c™ — ^/n (without taking a minimum 
in (6)) would be comparable to [ ], while we use now = oo unless an explicite 
mean-shift is performed. This setting might fail, if all new points point into the 
same direction viewed from x^ (suggesting c™ « as a compromise). The 
new setting seems to be slightly simpler and might turn out preferable also in 
the adaptive encoding setting, even when leaving aside step-size adaptation. 

Due to the minor modifications we do not expect an adverse interference 
with negative updates of the covariance matrix as in active CMA-ES [7, 5]. 
On the contrary, limiting the length of steps that enter the negative update 
mitigates a principle design flaw of negative updates: long steps tend to be 
worse (and therefore enter the negative update with a higher probability) and 
tend to produce a stronger update effect, both just because they are long and 
not because they indicate an undesirable direction. 

Finally, it is well possible to inject the same solution several times, for ex- 
ample based on its superior quality. One might, for example, consider to un- 
conditionally (re-)inject the best-ever solution in every iteration. Then, an "eli- 
tist algorithm with comma selection" arises — introducing an easy and appeal- 
ing way to implement elitism in evolution strategies with weighted multi- 
recombination. 

A generalized approach to normalize injected solutions compares the em- 

pirical CDF of the lengths Ij = ||C* ^yi||, i = 1,...,A, t = 1,2,..., with a 
desired CDF, F, and reduces the length of such that the observed relative 
frequency of lengths larger than or equal to Ij is below, say, 1.2(1 — F{ll)). 
In (12), the desired CDF is very crudely chosen to be F{x) = lx<cy- The 

desired operation in theory is Z* ^ -^desired(^triiG(^*))- A practicable imple- 

— - 

mentation could compare Z* — v2 x (||C* ^y^H — E||A/'(0, 1)||) with the stan- 
dard normal distribution J^, in that a correction is applied if Z* > 1 and 

7xE,,fcM->;j > 1-2(1 --^(^D)- 

^However a mean shift without further updates will impair the meaning of the evolution 
paths. 
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4 Preliminary Experiments 

Preliminary empirical investigations have been conducted by injecting a single 
slightly perturbed optimal solution. This is virtually the best case scenario 
when the distribution mean m is far away from the optimum. This becomes the 
worst case scenario, when m is closer to the optimum than the perturbation. 
Single runs on the sphere function and the Rosenbrock function are shown in 
Figures 3 and 4. The upper black graph depicts the worst iteration- wise function 
value and reveals the (sharp) transition between best and worst case scenario 
by showing convergence first and stagnation afterwards. 

The improvement on the sphere function is limited to a factor of about two, 
namely due to the maximal iteration-wise step-size decrement. This limit can be 
exceeded by additionally decreasing a when the injected solution is trustworthy, 
has a good quality, and is close to m (in the norm defined by cr^C). The precise 
implementation (the question of what is close to m) might also depend on 
As to be expected, the effect of injecting single bad solutions (worst case scenario 
in the later stage) is negligible. 

The improvement on the Rosenbrock function exceeds our expectation: we 
see a speed-up by a factor of almost n, simply because the speed is similar to the 
one on the sphere function with injection. Again, this speed-up can be further 
enhanced by step-size decrements. 

Experiments for an injected mean-shift have not been conducted yet. 

Experiments injecting always the best-ever solution reveal a moderate per- 
formance impairment when searching multimodal landscapes. 

5 Further Considerations 

Another case of application is temporary freezing of some variables (coordinates) 
to the same value in all candidate solutions . (This decreases the length of the 
step in the Euclidean norm, but due to correlations in the distribution this can 
lead to exceptionally long steps in Mahalanobis distance even if the frozen value 
is borrowed from m). In this case, it is also advisable to slightly modify the 
step-size equations (8) and (11). Given j variables are frozen, these variables 
are not taken into account for computing ||p^+^|| and consequently n — j is used 
instead of n in (8) and E||A/'(0, 1)|| is computed for n — j dimensions in (11). 
After one iteration, the respective components of will be zero (given Cm = 1) 
and also Cy should be set as for dimension n — j. In principle, all parameters 
from Table 1 can then be set as for dimension n ^ j. Additionally, in order to 
avoid numerical problems, the diagonal elements of the frozen coordinates of the 
covariance matrix should be kept at least in the order of the smallest eigenvalue. 

6 Summary and Conclusion 

Using candidate proposals in the CMA-ES that do not directly stem from the 
sample distribution of CMA-ES can often lead to a failure of the algorithm. 
The effective counter measures however turn out to be comparatively simple: 
only the appearance of large steps needs to be tightly controlled, where large is 
defined w.r.t. the original sample distribution. The possibility to inject any can- 
didate solution is valuable in many situations. In case of bounds or constraints 
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Figure 3: Runs of CMA-ES on the sphere function, middle and lower row with 
a single injected solution distributed as lO"** x A/'(0, 1) (which corresponds in 
the beginning to a slightly disturbed gradient direction). Black lines show the 
evolution of median and worst solution. The evolution of the worst solution indi- 
cates that — with default parameter setting on the sphere function — a speed-up 
by a factor of two can be achieved. The reason for the comparatively moderate 
speed-up is that the step-size decrease per iteration is limited. With reduced 
population size (lower row) the speed-up increases because the number of it- 
erations to reach function value 10~^ in the best case scenario remains almost 
constant. 



where a repair mechanism is available, this might serve as basis for a new class 
of well-performing constraint handling mechanisms. 
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Figure 4: Runs of CMA-ES on the Rosenbrock function, middle and lower row 
with a single injected solution distributed as 1 + 10~^ x A/'(0,I), lower row 
without the clipping in (3). With injection, the convergence speed is again 
twice as fast as on the sphere function. Where the injected solutions are useful 
(for function values larger than about 10"**), the algorithm is almost n times 
faster than without injection (600 vs 5000 and 2000 vs 70000 evaluations in 10- 
and 40-D). Without clipping, the run does not fail only because the step-size 
increment is limited to exp(A™'''^) = 2.718 . . . per iteration. 
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